-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery #418
base: main
Are you sure you want to change the base?
LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery #418
Conversation
…eldsQuery // HighMed: from office # freq=3224339 freq=225338 // top scores time usage 425 milliseconds // complete time usage 401 milliseconds // HighHigh: but publisher # freq=1456553 freq=1289029 // top scores time usage 469 milliseconds // complete time usage 322 milliseconds // HighLow: with fung # freq=3709421 freq=1344 // top scores time usage 241 milliseconds // complete time usage 428 milliseconds // HighLow: date insult # freq=2020626 freq=4424 // top scores time usage 171 milliseconds // complete time usage 225 milliseconds
Perf tests result Run 1: TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHigh 3.53 (3.3%) 2.92 (5.3%) -17.2% ( -25% - -8%) 0.000 PKLookup 108.13 (7.7%) 119.85 (8.2%) 10.8% ( -4% - 28%) 0.000 CFQHighLow 14.88 (3.9%) 16.69 (12.5%) 12.2% ( -3% - 29%) 0.000 CFQHighMed 21.11 (4.1%) 25.87 (11.8%) 22.6% ( 6% - 40%) 0.000 Run 2: TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHigh 6.64 (3.1%) 5.63 (10.2%) -15.2% ( -27% - -1%) 0.000 CFQHighLow 8.35 (2.8%) 8.05 (15.0%) -3.6% ( -20% - 14%) 0.297 CFQHighMed 24.51 (5.3%) 24.90 (19.9%) 1.6% ( -22% - 28%) 0.733 PKLookup 110.06 (10.0%) 128.54 (7.9%) 16.8% ( -1% - 38%) 0.000 Run 3: TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighMed 13.01 (2.9%) 9.82 (7.8%) -24.5% ( -34% - -14%) 0.000 PKLookup 107.85 (8.1%) 111.79 (11.2%) 3.7% ( -14% - 24%) 0.236 CFQHighHigh 4.83 (2.6%) 5.06 (8.6%) 4.7% ( -6% - 16%) 0.018 CFQHighLow 14.95 (3.0%) 18.31 (19.0%) 22.5% ( 0% - 45%) 0.000 Run 4: TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighMed 11.11 (2.9%) 6.69 (4.1%) -39.7% ( -45% - -33%) 0.000 CFQHighLow 27.55 (3.8%) 25.46 (11.0%) -7.6% ( -21% - 7%) 0.003 CFQHighHigh 5.25 (3.2%) 4.96 (6.1%) -5.7% ( -14% - 3%) 0.000 PKLookup 107.61 (6.7%) 121.19 (4.6%) 12.6% ( 1% - 25%) 0.000
Perf tests result for commit 2ba435e Run 2: Run 3: Run 4: |
Run 1 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHighHigh 4.64 (6.5%) 3.30 (4.7%) -29.0% ( -37% - -19%) 0.000 CFQHighHigh 11.09 (6.0%) 9.61 (6.0%) -13.3% ( -23% - -1%) 0.000 PKLookup 103.38 (4.4%) 108.04 (4.3%) 4.5% ( -4% - 13%) 0.001 CFQHighMedLow 10.58 (6.1%) 12.30 (8.7%) 16.2% ( 1% - 33%) 0.000 CFQHighMed 10.70 (7.4%) 15.51 (11.2%) 44.9% ( 24% - 68%) 0.000 CFQHighLowLow 8.18 (8.2%) 12.87 (11.6%) 57.3% ( 34% - 84%) 0.000 CFQHighLow 14.57 (7.5%) 30.81 (15.1%) 111.4% ( 82% - 144%) 0.000 Run 2 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHighHigh 5.33 (5.7%) 4.02 (7.7%) -24.4% ( -35% - -11%) 0.000 CFQHighLowLow 17.14 (6.2%) 13.06 (5.4%) -23.8% ( -33% - -13%) 0.000 CFQHighMed 17.37 (5.8%) 14.38 (7.7%) -17.2% ( -29% - -3%) 0.000 PKLookup 103.57 (5.5%) 108.84 (5.9%) 5.1% ( -6% - 17%) 0.005 CFQHighMedLow 11.25 (7.2%) 12.70 (9.0%) 12.9% ( -3% - 31%) 0.000 CFQHighHigh 5.00 (6.2%) 7.54 (12.1%) 51.0% ( 30% - 73%) 0.000 CFQHighLow 21.60 (5.2%) 34.57 (14.1%) 60.0% ( 38% - 83%) 0.000 Run 3 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHighHigh 5.40 (6.9%) 4.06 (5.1%) -24.8% ( -34% - -13%) 0.000 CFQHighMedLow 7.64 (7.4%) 5.79 (6.3%) -24.2% ( -35% - -11%) 0.000 CFQHighHigh 11.11 (7.0%) 9.60 (5.9%) -13.6% ( -24% - 0%) 0.000 CFQHighLowLow 21.21 (7.6%) 21.22 (6.6%) 0.0% ( -13% - 15%) 0.993 PKLookup 103.15 (5.9%) 107.60 (6.9%) 4.3% ( -8% - 18%) 0.034 CFQHighLow 21.85 (8.1%) 34.18 (13.5%) 56.4% ( 32% - 84%) 0.000 CFQHighMed 12.07 (8.4%) 19.98 (16.7%) 65.5% ( 37% - 98%) 0.000 Run 4 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHigh 8.50 (5.8%) 6.85 (5.2%) -19.5% ( -28% - -8%) 0.000 CFQHighMedLow 10.89 (5.7%) 8.96 (5.4%) -17.8% ( -27% - -7%) 0.000 CFQHighMed 8.41 (5.8%) 7.74 (5.6%) -7.9% ( -18% - 3%) 0.000 CFQHighHighHigh 3.45 (6.7%) 3.38 (5.3%) -2.0% ( -13% - 10%) 0.287 CFQHighLowLow 7.82 (6.4%) 8.20 (7.5%) 4.8% ( -8% - 20%) 0.030 PKLookup 103.50 (5.0%) 110.69 (5.4%) 6.9% ( -3% - 18%) 0.000 CFQHighLow 11.46 (6.0%) 13.16 (6.7%) 14.8% ( 1% - 29%) 0.000
Perf test result from 1a71469
|
Run 1 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighMed 16.99 (4.9%) 16.84 (14.9%) -0.9% ( -19% - 19%) 0.790 PKLookup 98.49 (8.7%) 121.99 (9.1%) 23.9% ( 5% - 45%) 0.000 CFQHighHigh 5.82 (4.7%) 7.55 (10.0%) 29.8% ( 14% - 46%) 0.000 CFQHighLow 17.59 (5.2%) 25.01 (17.3%) 42.2% ( 18% - 68%) 0.000 Run 2 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighMed 18.28 (6.1%) 15.63 (15.6%) -14.5% ( -34% - 7%) 0.000 PKLookup 101.26 (10.0%) 105.39 (12.0%) 4.1% ( -16% - 28%) 0.243 CFQHighHigh 9.96 (5.6%) 10.47 (13.0%) 5.2% ( -12% - 25%) 0.101 CFQHighLow 9.71 (5.8%) 15.71 (34.0%) 61.8% ( 20% - 107%) 0.000
I re-ran perf test after mikemccand/luceneutil@0550148 has been merged: Results from Run 1:
Run 2:
Run 3:
Results from Run 1:
Run 2:
Run 3:
For one of the most negatively impacted query (-42.0%):
suggesting quite some CPU time is spent on merging impacts within the same field. I'm suspecting this may occur when the max score is being computed too frequently, as frequent term's skip list would be "dense" and is also used to determine lucene/lucene/core/src/java/org/apache/lucene/search/ImpactsDISI.java Lines 78 to 82 in 2a9adb8
|
Hi @jpountz @jimczi, when I was doing some more deep dive to see how this PR can be improved further, I noticed a difference of field-weight parameter passed to create
lucene/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java Lines 420 to 421 in 3b914a4
lucene/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java Lines 387 to 389 in 3b914a4
For the lucene/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java Lines 404 to 414 in 3b914a4
, whereas I feel |
For now I have created a spin-off issue https://issues.apache.org/jira/browse/LUCENE-10236 and a separate PR #444 for this. Please let me know if they look good to you. |
} | ||
|
||
return result; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suspecting this may occur when the max score is being computed too frequently, as frequent term's skip list would be "dense" and is also used to determine upTo for max score
That would make sense to me. Maybe update this class so that numLevels / getDocIdUpTo only use the impacts of the term that has the highest weight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I assume by "highest weight" here, you meant term that has lower doc frequencies, as opposed to field weight?
I also did a quick test with the following updated numLevels / getDocIdupTo implementations to approximate using lower doc frequencies term's impact
@Override
public int numLevels() {
// this is changed from Integer.MIN_VALUE
int result = Integer.MAX_VALUE;
// this is changed from Math.max
for (Impacts impacts : leadingImpactsPerField.values()) {
result = Math.min(result, impacts.numLevels());
}
return result;
}
@Override
public int getDocIdUpTo(int level) {
// this is changed from Integer.MAX_VALUE
int result = Integer.MIN_VALUE;
for (Impacts impacts : leadingImpactsPerField.values()) {
if (impacts.numLevels() > level) {
// this is changed from Math.min
result = Math.max(result, impacts.getDocIdUpTo(level));
}
}
return result;
}
For the slow query CFQHighHigh: at united +combinedFields=titleTokenized^4.0,body^2.0 # freq=2834104 freq=1185528
above, it did improve the from -42%
to -30%
~ -35%
, with the following JFR CPU result:
PERCENT CPU SAMPLES STACK
19.41% 11866 org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer$MultiFieldNormValues#advanceExact()
7.35% 4491 org.apache.lucene.search.DisiPriorityQueue#downHeap()
6.07% 3713 org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer#score()
3.59% 2193 org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1#collect()
3.50% 2143 org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#freq()
3.24% 1983 org.apache.lucene.search.DisjunctionDISIApproximation#advance()
3.09% 1889 org.apache.lucene.sandbox.search.CombinedFieldQuery$WeightedDisiWrapper#freq()
2.87% 1752 org.apache.lucene.search.DisiPriorityQueue#top()
2.75% 1681 java.lang.Math#round()
2.71% 1657 org.apache.lucene.search.DisiPriorityQueue#topList()
2.54% 1555 org.apache.lucene.codecs.lucene90.Lucene90NormsProducer$3#longValue()
2.36% 1441 org.apache.lucene.store.ByteBufferGuard#ensureValid()
2.11% 1292 org.apache.lucene.util.SmallFloat#longToInt4()
1.87% 1142 org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#score()
1.73% 1058 org.apache.lucene.search.DisiPriorityQueue#updateTop()
1.71% 1045 org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer#getNormValue()
1.68% 1030 org.apache.lucene.store.ByteBufferGuard#getByte()
1.59% 970 jdk.internal.misc.Unsafe#getByte()
1.57% 959 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
1.55% 948 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.18% 720 org.apache.lucene.search.ImpactsDISI#docID()
0.84% 515 org.apache.lucene.sandbox.search.CombinedFieldQuery$1$1#doMergeImpactsPerField()
0.79% 485 org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll()
0.76% 462 org.apache.lucene.search.DisiPriorityQueue#prepend()
0.73% 444 java.lang.Math#toIntExact()
0.60% 367 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#freq()
0.53% 324 org.apache.lucene.search.ImpactsDISI#advanceTarget()
0.51% 314 org.apache.lucene.codecs.MultiLevelSkipListReader#skipTo()
0.50% 306 org.apache.lucene.util.SmallFloat#intToByte4()
0.49% 299 org.apache.lucene.search.ImpactsDISI#nextDoc()
This CPU profiling result looks very similar to that of baseline. I'll do more testings to understand why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jpountz , I just noticed a place in the implementation that was called multiple times, and thus tried to "cache" the result to save the repeated computation 6d5e780. It turned out the improvement was quite obvious:
Results from python3 src/python/localrun.py -source combinedFieldsBig
Run 1
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighHigh 3.17 (3.5%) 2.70 (4.6%) -14.8% ( -22% - -6%) 0.000
PKLookup 117.15 (6.7%) 116.31 (10.3%) -0.7% ( -16% - 17%) 0.792
CFQHighMed 12.98 (5.2%) 18.26 (13.9%) 40.7% ( 20% - 63%) 0.000
CFQHighLow 17.30 (3.6%) 36.19 (20.1%) 109.2% ( 82% - 137%) 0.000
Run 2
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
PKLookup 104.98 (6.9%) 115.76 (8.5%) 10.3% ( -4% - 27%) 0.000
CFQHighMed 5.73 (3.9%) 6.53 (8.4%) 14.0% ( 1% - 27%) 0.000
CFQHighLow 17.87 (3.4%) 20.37 (8.1%) 14.0% ( 2% - 26%) 0.000
CFQHighHigh 5.41 (4.2%) 6.53 (8.9%) 20.7% ( 7% - 35%) 0.000
Run 3
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighHigh 3.18 (3.6%) 2.66 (5.3%) -16.4% ( -24% - -7%) 0.000
CFQHighMed 5.75 (4.0%) 5.07 (5.5%) -11.7% ( -20% - -2%) 0.000
PKLookup 136.87 (5.9%) 133.97 (9.4%) -2.1% ( -16% - 14%) 0.393
CFQHighLow 23.19 (2.9%) 38.74 (16.0%) 67.0% ( 46% - 88%) 0.000
Results from python3 src/python/localrun.py -source combinedFieldsUnevenlyWeightedBig
Run 1
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
PKLookup 102.30 (8.5%) 101.05 (15.1%) -1.2% ( -22% - 24%) 0.753
CFQHighMed 5.73 (6.7%) 6.61 (12.5%) 15.5% ( -3% - 37%) 0.000
CFQHighHigh 5.37 (6.2%) 7.28 (12.7%) 35.6% ( 15% - 58%) 0.000
CFQHighLow 13.61 (6.8%) 27.02 (34.4%) 98.5% ( 53% - 150%) 0.000
Run 2
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighMed 5.63 (4.0%) 4.87 (6.6%) -13.5% ( -23% - -2%) 0.000
PKLookup 95.27 (9.3%) 102.84 (9.0%) 7.9% ( -9% - 28%) 0.006
CFQHighHigh 5.45 (3.7%) 7.43 (10.6%) 36.3% ( 21% - 52%) 0.000
CFQHighLow 13.91 (4.6%) 29.42 (28.2%) 111.5% ( 75% - 151%) 0.000
Run 3
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
PKLookup 74.67 (39.5%) 97.84 (23.3%) 31.0% ( -22% - 155%) 0.002
CFQHighHigh 5.61 (4.5%) 7.62 (10.0%) 35.8% ( 20% - 52%) 0.000
CFQHighMed 10.44 (3.5%) 15.65 (23.8%) 50.0% ( 21% - 80%) 0.000
CFQHighLow 14.00 (4.2%) 25.99 (33.0%) 85.6% ( 46% - 128%) 0.000
However, this optimization doesn't help much for query CFQHighHigh: at united +combinedFields=titleTokenized^4.0,body^2.0 # freq=2834104 freq=1185528
, as the dense skip list for at
still cause frequent max score computation (although giving the field titleTokenized
more weight does indeed reduce perform hit from -42.0% to -34.3%). I tried to play with some different numLevels
and getDocIdUpTo
implementations, but so far I find it hard to balance the following two:
- lower docIdUpTo -> lower max score -> more frequent max score computation, but also more effective pruning due to lower max score
- higher docIdUpTo -> higher max score -> less frequent max score computation, but also less effective in pruning due to higher max score
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running these experiments, the performance does look much better now indeed! I would feel good about merging something that has these performance characteristics.
I assume by "highest weight" here, you meant term that has lower doc frequencies, as opposed to field weight?
I meant field weight. Since we're using impacts from the field that has the highest weight and approximating impacts from other fields, my assumption was that it would help get better max score approximations.
That said, I expect that it would ofter have the same result as picking the field that has the lowest doc freq since fields that have higher weight tend to have fewer terms per doc, so doc freqs are generally lower?
result = Math.max(result, impacts.getDocIdUpTo(level));
When it was Math.min
, it hurt us because we would have to recompute score boundaries often. But with Math.max
we are forcing all fields but one to return impacts on the next level. My gut feeling is that returning the getDocIdUpTo
of the field that has the highest weight (or lowest doc freq as you suggested) would perform better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jpountz for the confirmation and clarification! I gave these ideas a try in this commit db2446f, with the following two similar approaches:
- Use the impacts of a) highest weight field + b) lowest doc freq term within that field (commented out)
- Use the impacts of a) highest weight field + b) lowest
getDocIdUpTo
term within that field
Here are the results from python3 src/python/localrun.py -source combinedFieldsUnevenlyWeightedBig
Run 1:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighMed 5.46 (7.7%) 4.58 (7.1%) -16.1% ( -28% - -1%) 0.000
CFQHighLow 21.83 (7.7%) 18.86 (7.2%) -13.6% ( -26% - 1%) 0.000
CFQHighHigh 3.44 (7.4%) 3.06 (7.4%) -11.1% ( -24% - 4%) 0.000
PKLookup 109.63 (11.8%) 109.67 (13.5%) 0.0% ( -22% - 28%) 0.993
Run 2:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighLow 17.22 (10.2%) 11.73 (12.0%) -31.9% ( -49% - -10%) 0.000
CFQHighMed 10.20 (10.5%) 8.36 (9.3%) -18.1% ( -34% - 1%) 0.000
CFQHighHigh 3.65 (6.6%) 3.11 (5.3%) -14.8% ( -25% - -3%) 0.000
PKLookup 91.80 (26.0%) 100.13 (16.3%) 9.1% ( -26% - 69%) 0.187
Run 3:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighMed 5.62 (8.1%) 4.86 (8.0%) -13.5% ( -27% - 2%) 0.000
CFQHighHigh 4.13 (6.9%) 3.59 (5.8%) -13.0% ( -24% - 0%) 0.000
CFQHighLow 17.00 (8.2%) 15.06 (9.7%) -11.4% ( -27% - 7%) 0.000
PKLookup 99.61 (6.4%) 102.96 (6.7%) 3.4% ( -9% - 17%) 0.104
and their CPU profiling looks very similar to that of baseline, suggesting that pruning was not very effective there.
Interestingly, when I flipped the weights assigned to body
and titleTokenized
fields in the task (assigned 20.0 to body
field, and 2.0 to titleTokenized
field), I got the following result:
Run 1:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
PKLookup 99.64 (9.1%) 94.80 (11.6%) -4.9% ( -23% - 17%) 0.142
CFQHighLow 17.31 (7.2%) 19.16 (15.0%) 10.7% ( -10% - 35%) 0.004
CFQHighMed 3.41 (6.4%) 4.57 (13.6%) 34.2% ( 13% - 57%) 0.000
CFQHighHigh 3.28 (6.3%) 5.32 (16.8%) 62.2% ( 36% - 91%) 0.000
Run 2:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighHigh 4.10 (6.3%) 2.72 (5.8%) -33.6% ( -42% - -23%) 0.000
CFQHighMed 5.69 (5.6%) 4.48 (8.2%) -21.4% ( -33% - -8%) 0.000
PKLookup 104.53 (21.1%) 104.06 (16.8%) -0.5% ( -31% - 47%) 0.940
CFQHighLow 22.52 (7.4%) 35.31 (21.3%) 56.8% ( 26% - 92%) 0.000
Run 3:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
PKLookup 80.39 (31.2%) 93.55 (19.4%) 16.4% ( -26% - 97%) 0.046
CFQHighHigh 5.71 (6.6%) 6.96 (13.4%) 21.7% ( 1% - 44%) 0.000
CFQHighMed 10.21 (5.6%) 12.62 (22.0%) 23.6% ( -3% - 54%) 0.000
CFQHighLow 16.89 (4.2%) 27.39 (22.9%) 62.1% ( 33% - 93%) 0.000
I think what might have happened here is, as highest weighted field / lowest doc freq term tends to have fewer terms per doc, by definition their matches are also more sparse in the corpus, compared with field with lower weight (i.e. there are more docs with a specific term showing up in its body
field, compared with titleTokenized
field). Hence when returning getDocIdUpTo
of the field that has the highest weight, a larger bound will be returned (probably close to Math.max
?), causing max score to be much higher and less effective in pruning.
What do you think of the commit and results? Is there anything I may have missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-ran the perf tests combinedFieldsUnevenlyWeightedBig
after bug fix 3d0a215, and the results look similar:
With commit test leadImpact from highest weighted field
db2446f
Run 1:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighHigh 2.81 (10.9%) 2.40 (10.3%) -14.6% ( -32% - 7%) 0.000
CFQHighMed 3.27 (11.8%) 2.91 (11.0%) -11.2% ( -30% - 13%) 0.002
CFQHighLow 16.09 (14.6%) 14.58 (11.9%) -9.4% ( -31% - 20%) 0.025
PKLookup 92.08 (11.6%) 96.23 (13.3%) 4.5% ( -18% - 33%) 0.255
Run 2:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighLow 15.76 (12.0%) 10.19 (12.9%) -35.3% ( -53% - -11%) 0.000
CFQHighMed 12.01 (13.3%) 8.79 (13.9%) -26.9% ( -47% - 0%) 0.000
CFQHighHigh 5.59 (10.8%) 4.40 (8.4%) -21.2% ( -36% - -2%) 0.000
PKLookup 93.36 (21.9%) 94.29 (22.6%) 1.0% ( -35% - 58%) 0.888
Without commit test leadImpact from highest weighted field
db2446f
Run 1:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighLow 16.69 (7.9%) 15.02 (14.1%) -10.0% ( -29% - 13%) 0.006
PKLookup 94.38 (9.7%) 105.59 (13.3%) 11.9% ( -10% - 38%) 0.001
CFQHighHigh 5.51 (7.0%) 7.59 (13.5%) 37.6% ( 15% - 62%) 0.000
CFQHighMed 10.24 (7.2%) 15.23 (20.0%) 48.8% ( 20% - 81%) 0.000
Run 2:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
PKLookup 94.85 (11.6%) 92.29 (11.0%) -2.7% ( -22% - 22%) 0.448
CFQHighHigh 5.26 (13.7%) 6.25 (23.4%) 18.9% ( -16% - 64%) 0.002
CFQHighMed 3.39 (16.1%) 4.55 (29.7%) 34.3% ( -9% - 95%) 0.000
CFQHighLow 16.04 (14.9%) 29.22 (39.6%) 82.2% ( 24% - 160%) 0.000
Run 3:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
PKLookup 68.96 (40.4%) 67.26 (40.1%) -2.5% ( -59% - 130%) 0.847
CFQHighHigh 4.85 (16.7%) 5.62 (33.8%) 15.7% ( -29% - 79%) 0.062
CFQHighMed 10.67 (20.1%) 12.47 (42.7%) 16.9% ( -38% - 99%) 0.110
CFQHighLow 14.07 (22.6%) 23.18 (68.7%) 64.8% ( -21% - 201%) 0.000
I guess the scoring function has taken care of / bounded the erroneous norm used in approximation earlier as it was effectively Long.MIN_VALUE
.
Results from python3 src/python/localrun.py -source combinedFieldsUnevenlyWeightedBig: Run 1: TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHigh 2.81 (10.9%) 2.40 (10.3%) -14.6% ( -32% - 7%) 0.000 CFQHighMed 3.27 (11.8%) 2.91 (11.0%) -11.2% ( -30% - 13%) 0.002 CFQHighLow 16.09 (14.6%) 14.58 (11.9%) -9.4% ( -31% - 20%) 0.025 PKLookup 92.08 (11.6%) 96.23 (13.3%) 4.5% ( -18% - 33%) 0.255 Run 2: TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighLow 15.76 (12.0%) 10.19 (12.9%) -35.3% ( -53% - -11%) 0.000 CFQHighMed 12.01 (13.3%) 8.79 (13.9%) -26.9% ( -47% - 0%) 0.000 CFQHighHigh 5.59 (10.8%) 4.40 (8.4%) -21.2% ( -36% - -2%) 0.000 PKLookup 93.36 (21.9%) 94.29 (22.6%) 1.0% ( -35% - 58%) 0.888
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making progress on this and sorry for the slow review cycles. I just had a quick look and it seems that some computations can be move up-front such as deciding what field we use as a lead since field weights do not change during query execution?
I wonder if there's things we can do to make the code simpler, such as sharing the logic to combine multiple terms together within the same field with SynonymQuery.
@@ -402,14 +423,30 @@ public Explanation explain(LeafReaderContext context, int doc) throws IOExceptio | |||
public Scorer scorer(LeafReaderContext context) throws IOException { | |||
List<PostingsEnum> iterators = new ArrayList<>(); | |||
List<FieldAndWeight> fields = new ArrayList<>(); | |||
Map<String, List<ImpactsEnum>> fieldImpactsEnum = new HashMap<>(fieldAndWeights.size()); | |||
Map<String, List<Integer>> fieldTermDocFreq = new HashMap<>(fieldAndWeights.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually nee this list of doc freqs? They would be equal to impactsEnum#cost all the time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was actually not aware that ImpactsEnum#cost
returns doc freqs directly. Will remove it later.
maxWeight = weight; | ||
maxWeightField = field; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since field weights do not change over time, could we compute the field that has the higest weight up-front instead of doing it every time getImpacts
is called?
No worry and thanks for the review and feedback!
Yup I can move it out to avoid repeated computation there. Just a quick check though, from the perf results so far (#418 (comment) & #418 (comment)), it seems using the most weighted field to decide on the leading impacts may not yield the best result, as matches in most weighted field may also give a high docId upTo bound. If we were to drop that approach and revert the commit db2446f that introduced it, then it won't be computing most weighted field at all. Are you trying to see if perf tests can be further improved if most weighted field only gets computed once?
Yeah I have the same thought there. I wasn't sure earlier as |
Yes. I liked this approach because it felt like it should work relatively well since the field with the highest weight should drive scores anyway, and deciding about which clause leads impacts up-front felt like it could simplify the logic a bit. But if it doesn't yield better results, let's fall back to the previous approach. Let's maybe just double check with a profiler that the reason why this approach performs worse is actually because we get worse score boundaries and not because of some avoidable slow code like iterating or lookip up the various hash maps?
Maybe we could have a |
… replace doc freq collection with ImpactsEnum#cost
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighLow 21.45 (13.8%) 15.65 (18.0%) -27.1% ( -51% - 5%) 0.000 CFQHighMed 5.92 (4.2%) 4.52 (12.2%) -23.7% ( -38% - -7%) 0.000 CFQHighHigh 5.52 (5.0%) 4.45 (12.1%) -19.4% ( -34% - -2%) 0.000 PKLookup 110.45 (11.1%) 108.08 (15.7%) -2.1% ( -26% - 27%) 0.617 = Candidate CPU JFR PROFILE SUMMARY from 86268 events (total: 86268) tests.profile.mode=cpu tests.profile.count=30 tests.profile.stacksize=1 tests.profile.linenumbers=false PERCENT CPU SAMPLES STACK 15.48% 13355 org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer$MultiFieldNormValues#advanceExact() 7.37% 6359 org.apache.lucene.search.DisjunctionDISIApproximation#advance() 7.35% 6345 org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer#score() 6.38% 5508 org.apache.lucene.search.DisiPriorityQueue#downHeap() 4.18% 3605 org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#freq() 4.10% 3538 org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1#collect() 4.01% 3456 org.apache.lucene.search.DisiPriorityQueue#topList() 3.67% 3169 org.apache.lucene.sandbox.search.CombinedFieldQuery$WeightedDisiWrapper#freq() 3.36% 2899 org.apache.lucene.search.DisiPriorityQueue#top() 3.08% 2660 java.lang.Math#round() 2.40% 2067 org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer#getNormValue() 2.38% 2057 org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#score() 1.94% 1675 org.apache.lucene.util.SmallFloat#longToInt4() 1.87% 1617 org.apache.lucene.codecs.lucene90.Lucene90NormsProducer$3#longValue() 1.86% 1605 org.apache.lucene.store.ByteBufferGuard#ensureValid() 1.80% 1557 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater() 1.50% 1293 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance() 1.39% 1200 org.apache.lucene.search.ImpactsDISI#docID() 1.33% 1145 org.apache.lucene.search.DisiPriorityQueue#updateTop() 1.24% 1067 jdk.internal.misc.Unsafe#getByte() 1.18% 1019 org.apache.lucene.store.ByteBufferGuard#getByte() 0.92% 792 org.apache.lucene.util.SmallFloat#intToByte4() 0.76% 659 java.lang.Math#toIntExact() 0.68% 584 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#freq() 0.67% 579 org.apache.lucene.search.ImpactsDISI#nextDoc() 0.61% 530 org.apache.lucene.search.ImpactsDISI#advanceTarget() 0.58% 500 org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll() 0.57% 492 org.apache.lucene.search.DisiPriorityQueue#prepend() 0.40% 345 org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32() 0.39% 333 org.apache.lucene.codecs.lucene90.ForUtil#expand8() = Baseline CPU JFR PROFILE SUMMARY from 76683 events (total: 76683) tests.profile.mode=cpu tests.profile.count=30 tests.profile.stacksize=1 tests.profile.linenumbers=false PERCENT CPU SAMPLES STACK 16.26% 12468 org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer$MultiFieldNormValues#advanceExact() 8.46% 6490 org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer#score() 8.24% 6315 org.apache.lucene.search.DisjunctionDISIApproximation#nextDoc() 6.13% 4698 org.apache.lucene.search.DisiPriorityQueue#downHeap() 4.79% 3676 org.apache.lucene.search.DisiPriorityQueue#topList() 4.45% 3412 org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1#collect() 4.32% 3314 org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#freq() 3.25% 2492 java.lang.Math#round() 3.01% 2309 org.apache.lucene.search.DisiPriorityQueue#top() 2.75% 2107 org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#score() 2.62% 2010 org.apache.lucene.sandbox.search.CombinedFieldQuery$WeightedDisiWrapper#freq() 2.29% 1753 org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll() 2.12% 1623 org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer#getNormValue() 2.05% 1569 org.apache.lucene.util.SmallFloat#longToInt4() 1.93% 1481 org.apache.lucene.store.ByteBufferGuard#ensureValid() 1.77% 1358 org.apache.lucene.codecs.lucene90.Lucene90NormsProducer$3#longValue() 1.45% 1110 org.apache.lucene.search.DisiPriorityQueue#updateTop() 1.25% 958 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#nextDoc() 1.19% 914 org.apache.lucene.store.ByteBufferGuard#getByte() 1.17% 898 java.lang.Math#toIntExact() 1.06% 815 org.apache.lucene.util.SmallFloat#intToByte4() 1.02% 782 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#freq() 0.86% 656 jdk.internal.misc.Unsafe#getByte() 0.68% 518 org.apache.lucene.codecs.lucene90.PForUtil#decodeAndPrefixSum() 0.64% 492 org.apache.lucene.search.DisiPriorityQueue#prepend() 0.49% 375 org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32() 0.42% 322 org.apache.lucene.codecs.lucene90.ForUtil#expand8() 0.31% 239 java.util.zip.Inflater#inflateBytesBytes() 0.30% 228 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact() 0.29% 224 org.apache.lucene.codecs.lucene90.PForUtil#expand32()
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHigh 3.68 (6.5%) 2.66 (7.3%) -27.8% ( -39% - -14%) 0.000 CFQHighLow 14.19 (7.9%) 11.31 (8.0%) -20.3% ( -33% - -4%) 0.000 CFQHighMed 12.92 (9.5%) 10.41 (9.9%) -19.4% ( -35% - 0%) 0.000 PKLookup 104.92 (6.7%) 104.49 (9.0%) -0.4% ( -15% - 16%) 0.869 = Candidate CPU JFR PROFILE SUMMARY from 87201 events (total: 87201) tests.profile.mode=cpu tests.profile.count=30 tests.profile.stacksize=1 tests.profile.linenumbers=false PERCENT CPU SAMPLES STACK 15.13% 13191 org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer$MultiFieldNormValues#advanceExact() 7.42% 6466 org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer#score() 7.29% 6354 org.apache.lucene.search.DisjunctionDISIApproximation#advance() 6.79% 5921 org.apache.lucene.search.DisiPriorityQueue#downHeap() 4.34% 3785 org.apache.lucene.search.DisiPriorityQueue#topList() 3.97% 3465 org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1#collect() 3.85% 3360 org.apache.lucene.sandbox.search.CombinedFieldQuery$WeightedDisiWrapper#freq() 3.18% 2777 java.lang.Math#round() 3.01% 2624 org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#freq() 3.01% 2624 org.apache.lucene.search.DisiPriorityQueue#top() 2.33% 2036 org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#score() 2.20% 1917 org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer#getNormValue() 2.07% 1809 org.apache.lucene.util.SmallFloat#longToInt4() 1.81% 1580 org.apache.lucene.codecs.lucene90.Lucene90NormsProducer$3#longValue() 1.78% 1555 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance() 1.77% 1545 org.apache.lucene.store.ByteBufferGuard#ensureValid() 1.51% 1316 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater() 1.39% 1210 org.apache.lucene.search.DisiPriorityQueue#updateTop() 1.29% 1124 org.apache.lucene.search.ImpactsDISI#docID() 1.22% 1063 jdk.internal.misc.Unsafe#getByte() 1.19% 1037 org.apache.lucene.store.ByteBufferGuard#getByte() 1.08% 938 org.apache.lucene.search.DisiPriorityQueue#prepend() 0.97% 849 org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll() 0.86% 750 org.apache.lucene.codecs.MultiLevelSkipListReader#skipTo() 0.82% 717 org.apache.lucene.util.SmallFloat#intToByte4() 0.78% 683 java.lang.Math#toIntExact() 0.62% 540 org.apache.lucene.codecs.lucene90.PForUtil#decode() 0.61% 529 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#freq() 0.51% 441 org.apache.lucene.search.ImpactsDISI#nextDoc() 0.50% 435 org.apache.lucene.search.ImpactsDISI#advanceTarget() = Baseline CPU JFR PROFILE SUMMARY from 81578 events (total: 81578) tests.profile.mode=cpu tests.profile.count=30 tests.profile.stacksize=1 tests.profile.linenumbers=false PERCENT CPU SAMPLES STACK 16.38% 13363 org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer$MultiFieldNormValues#advanceExact() 8.67% 7073 org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer#score() 8.05% 6569 org.apache.lucene.search.DisjunctionDISIApproximation#nextDoc() 6.50% 5305 org.apache.lucene.search.DisiPriorityQueue#downHeap() 5.17% 4220 org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#freq() 4.40% 3587 org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1#collect() 4.30% 3505 org.apache.lucene.search.DisiPriorityQueue#topList() 3.37% 2751 java.lang.Math#round() 2.77% 2263 org.apache.lucene.search.DisiPriorityQueue#top() 2.64% 2156 org.apache.lucene.sandbox.search.CombinedFieldQuery$WeightedDisiWrapper#freq() 2.62% 2135 org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#score() 2.12% 1729 org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer#getNormValue() 2.11% 1723 org.apache.lucene.util.SmallFloat#longToInt4() 2.06% 1678 org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll() 2.01% 1640 org.apache.lucene.store.ByteBufferGuard#ensureValid() 1.77% 1442 org.apache.lucene.codecs.lucene90.Lucene90NormsProducer$3#longValue() 1.62% 1325 org.apache.lucene.search.DisiPriorityQueue#updateTop() 1.26% 1028 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#nextDoc() 1.10% 899 java.lang.Math#toIntExact() 1.10% 898 org.apache.lucene.util.SmallFloat#intToByte4() 1.09% 888 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#freq() 1.06% 864 org.apache.lucene.store.ByteBufferGuard#getByte() 0.80% 652 jdk.internal.misc.Unsafe#getByte() 0.74% 605 org.apache.lucene.codecs.lucene90.PForUtil#decode() 0.68% 552 org.apache.lucene.search.DisiPriorityQueue#prepend() 0.60% 492 org.apache.lucene.codecs.lucene90.PForUtil#decodeAndPrefixSum() 0.55% 451 java.io.RandomAccessFile#readBytes() 0.38% 314 org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32() 0.38% 307 org.apache.lucene.codecs.lucene90.ForUtil#expand8() 0.32% 260 org.apache.lucene.codecs.lucene90.PForUtil#expand32()
Ah yes using |
I see. The latest two commits 808fec2 & 75c5b04 used max weight field to drive scoring, and overall they had around -20% impacts to tasks from Candidate CPU JFR:
Baseline CPU JFR
As they look very similar, based on my previous debugging, it may suggest the max score being computed is high and most of the docs are not being skipped, hence the candidate implementation ended up examining the same amount of docs with baseline. |
* Merge impacts from multiple impactsEnum (terms matches) within the same field. The high level | ||
* logic is to combine freqs that have the same norm from impacts. | ||
*/ | ||
public static List<Impact> mergeImpactsPerField( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MixedMutabilityReturnType: This method returns both mutable and immutable collections or maps from different paths. This may be confusing for users of the method. (details)
(at-me in a reply with help
or ignore
)
I just added two logging statements in
and the following query was used:
For the implementation that uses mostly weighted field and lowest
For the implementation that uses mostly weighted field and lowest cost / doc freq to drive scoring,
For the implementation that uses lowest
|
These numbers are super interesting. Let's go with the approach of you baseline for now since it performs better, sorry for putting you on a track that performs worse. :) Regarding the new utility class, I was hoping that it could be a higher-level abstraction, eg. a |
No problem! It's actually quite enjoyable for me to explore different approaches and compare their performance characteristics. I may also miss certain aspects as well, so this discussion helps to verify my understanding also! I've re-implemented the other approach in cbbd28b.
Yeah I did give that a try earlier, but I ended up with the utility methods approach as I saw some potential issue. I implemented it again in 502d44e. If my understanding was correct, the intention of using higher-level abstraction there was to reduce code duplication by using a loop over |
Hi @jpountz , just want to circle back to this discussion and see if you may have further suggestions to the above analysis? |
Hi @jpountz , just want to check again to see if I have addressed your concern with the utility approach? To better illustrate the potential issue I'm considering with the higher level abstraction approach, I've copied over some key code snippets here: If we were to extract out
However, the above
Hence we may actually need to change some APIs if we were to make |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Description
This PR is WIP for implementing basic dynamic pruning support for CombinedFieldsQuery
#11099
Tests
Added a randomized test to compare and verify top 100 results match between top_score (with pruning) and complete scoring.